Pre-processing of Bilingual Corpora for Mandarin-English EBMT
نویسندگان
چکیده
Pre-processing of bilingual corpora plays an important role in Example-Based Machine Translation (EBMT) and Statistical-Based Machine Translation (SBMT). For our Mandarin-English EBMT system, pre-processing includes segmentation for Mandarin, bracketing for English and building a statistical dictionary from the corpora. We used the Mandarin segmenter from the Linguistic Data Consortium (LDC). It uses dynamic programming with a frequency dictionary to segment the text. Although the frequency dictionary is large, it does not completely cover the corpora. In this paper, we describe the work we have done to improve the segmentation for Mandarin and the bracketing process for English to increase the length of English phrases. A statistical dictionary is built from the aligned bilingual corpus. It is used as feedback to segmentation and bracketing to re-segment / re-bracket the corpus. The process iterates several times to achieve better results. The final results of the corpus pre-processing are a segmented/bracketed aligned bilingual corpus and a statistical dictionary. We achieved positive results by increasing the average length of Chinese terms about 60% and 10% for English. The statistical dictionary gained about a 30% increase in coverage.
منابع مشابه
Noun versus Verb Bias in Mandarin- English Bilingual Pre-School Children
This study investigated the presence of noun or verb bias in 15 MandarinEnglish bilingual pre-school children. The naturalistic bilingual child-caregiver interactions were tape-recorded for 30 minutes each time. The study also addressed the relationship between children‟s language production and the salient positions of the caregivers‟ language input. The findings show that the bilingual childr...
متن کاملBuilding Multiword Expressions Bilingual Lexicons for Domain Adaptation of an Example-Based Machine Translation System
We describe in this paper a hybrid approach to build automatically bilingual lexicons of Multiword Expressions (MWEs) from parallel corpora. We more specifically investigate the impact of using a domain-specific bilingual lexicon of MWEs on domain adaptation of an Example-Based Machine Translation (EBMT) system. We conducted experiments on the English-French language pair and two kinds of texts...
متن کاملSubsentential Translation Memory for Computer Assisted Writing and Translation
This paper describes a database of translation memory, TotalRecall, developed to encourage authentic and idiomatic use in second language writing. TotalRecall is a bilingual concordancer that support search query in English or Chinese for relevant sentences and translations. Although initially intended for learners of English as Foreign Language (EFL) in Taiwan, it is a gold mine of texts in En...
متن کاملThe Effect of Second-Language Experience on Native-Language Processing.
Previous work on bilingual language processing indicates that native-language skills can influence second-language acquisition. The goal of the present work was to examine the influence of second-language experiences on native-language vocabulary and reading skills in two groups of bilingual speakers. English-Spanish and English-Mandarin bilingual adults were tested on vocabulary knowledge and ...
متن کاملMultilingual Speech Corpora for TTS System Development
In this paper, four speech corpora collected in the Speech Lab of NCTU in recent years are discussed. They include a Mandarin treebank speech corpus, a Min-Nan speech corpus, a Hakka speech corpus, and a Chinese-English mixed speech corpus. Currently, they are used separately to develop a corpus-based Mandarin TTS system, a Min-Nan TTS system, a Hakka TTS system, and a Chinese-English bilingual...
متن کامل